Method and system to search objects in published literature for information discovery tasks

ABSTRACT

The present invention relates to the identification, extraction, linking, storage and provisioning of data that constitute the captioned components of published or “print ready” literature for computerized information discovery activities including search, browse and data mining. These components, or objects, include the tabular presentation of data (“tables”) and graphics such as “figures”, “images” and “illustrations” typically used to supplement the textual narrative of the publication.

RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No.60/783,459 filed Mar. 17, 2006 entitled “Method and System to IndexCaptioned Objects in Published Literature for Information DiscoveryTasks,” the disclosure of which also is entirely incorporated herein byreference.

BACKGROUND

1. Field

The present invention relates generally to automatic information capturetechniques and, more particularly to the secondary publishing (or,abstracting and indexing) industry.

2. Background

Captioned components such as figures and tables represent the distilledessence of research communicated in academic articles. Although themarginalia surrounding these displays of data is useful, researchers areeager to view the actual data collected, observed, or modeled todetermine the article's relevance to their work. Raw data sets areusually unavailable, but the processed data displayed in figures andtables are as, or even more, valuable.

The primary objective of a literature search is to find articlescontaining information most relevant to researchers' interests. Neithertraditional article-level indexing provided by standard Abstracting &Indexing (A & I) services, nor full-text indexing whereby all textwithin a document is indexed, can restrict a result set to only thosepublications which contain data of interest.

For one reason, many key variables are excluded from traditional A&Isearches because, although discretely important, they are generally notreflected in the more general nature of the author's abstract or thearticle title, traditional grist for the A&I indexing mill. Also,variables can be hidden from full-text searches because critical textwithin figures and tables is actually part of an image file which is notindexed (and made searchable) in full-text search systems. Webharvesters (e.g. Google) do not distil text from images. Furthermore,variables are ‘diluted’ in full-text indexes because many matches areperipheral; i.e., the variable of interest appears as an indirectreference (e.g. in a literature reference cited within an article). As aresult, the identified article may not actually contain a figure ortable including that particular variable.

A secondary objective of a literature search has been moreintractable—and arguably more valuable. Any variable appearing in afigure or table within an article can be searched and linked to otherstudies examining the same variable. Traditional A&I services areadequate tools to help answer research questions, but there remains aneed for indexing other information such as, for example, tables andfigures that goes further. By revealing data links in studies acrossdisciplines, new avenues of research can be illuminated.

SUMMARY

It is understood that other embodiments of the present invention willbecome readily apparent to those skilled in the art from the followingdetailed description, wherein it is shown and described only variousembodiments of the invention by way of illustration. As will berealized, the invention is capable of other and different embodimentsand its several details are capable of modification in various otherrespects, all without departing from the spirit and scope of the presentinvention. Accordingly, the drawings and detailed description are to beregarded as illustrative in nature and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

Various aspects of a system for indexing and locating captioned objectsis illustrated by way of example, and not by way of limitation, in theaccompanying drawings, wherein:

FIGS. 1A and 1B illustrate an exemplary document having a captionedobject along with a detailed view of the captioned object;

FIGS. 2A and 2B illustrate another exemplary document having a captionedobject along with a detailed view of that captioned object;

FIG. 2C illustrates an exemplary section of a document referencing acaptioned object;

FIG. 3 depicts an exemplary computer system on which an embodiment ofthe present invention may be implemented;

FIG. 4 depicts a flowchart of an exemplary algorithm of indexingcaptioned objects according to the principles of the present invention;

FIG. 5 depicts an exemplary extraction rule;

FIG. 6 depicts an exemplary system for extracting, indexing, searchingand retrieving captioned objects in accordance with the principles ofthe present invention;

FIG. 7 illustrates an exemplary extracted object as XML;

FIG. 8 illustrates an exemplary editorial screen for extractinginformation about captioned objects in accordance with the principles ofthe present invention;

FIG. 9 graphically depicts an association between related objects andabstracts;

FIG. 10 provides a table that illustrates relationships between objects,attributes, and abstracts that are identifiable according to theprinciples of the present invention;

FIGS. 11A-11E depict exemplary interface screen shots of a searchapplication involving captioned objects;

FIGS. 12A and 12B depict exemplary interface screen shots of anothersearch application;

FIGS. 13A-13I depict exemplary captioned objects that may be used indifferent embodiments of the present invention to provide advantagesover merely textual abstracting and indexing; and

FIGS. 14A-14E depict exemplary interface screen shots of another searchapplication involving captioned objects, including an enhanced abstract.

DETAILED DESCRIPTION

The detailed description set forth below in connection with the appendeddrawings is intended as a description of various embodiments of theinvention and is not intended to represent the only embodiments in whichthe invention may be practiced. The detailed description includesspecific details for the purpose of providing a thorough understandingof the invention. However, it will be apparent to those skilled in theart that the invention may be practiced without these specific details.In some instances, well known structures and components are shown inblock diagram form in order to avoid obscuring the concepts of theinvention. In particular, exemplary embodiments are provided below thatspecifically describe camera-ready or printed documents. Such specificsare for illustrative purposes only and one of ordinary skill willrecognize that documents of various, different formats may be usedwithout departing from the scope of the present invention.

Captioned Objects in Published Research

FIG. 1A is an illustration of a print or camera-ready document fromwhich captioned objects may be extracted by embodiments of the presentinvention. As described herein, a print or camera-ready document is adocument which is already in a printed publication, or shortly going tobe made available for dissemination via a publication. For the purposesof exposition, and without loss of the wider contexts in which thisinvention is intended to serve, these documents are assumed to containscholarly content meant for dissemination to a wider audience ofresearchers, and will be referred to as “research articles”. Theprint-ready articles may be associated with a traditional paper-basedpublication, or be available via an “e journal”. Regardless of thechannel in which the articles have been, or will be, disseminated, theseresearch articles contain several distinct components that arerecognized in the art. In the abstracting, indexing and search context,these components are commonly referred to in the art as “citation”information (for example, “title”, “author(s)”, “publication”, “volume”,“issue”, “page numbers”) that can uniquely identify the article and itsassociated publication, an “abstract” (a short section of text thatsummarizes the document), the “full-text” (the main body of thedocument) and “cited references” (references to other articles used bythe authors(s) in the article). An abstract may be provided by theauthor(s), or an abstract may be written by a third-party such as anabstracting and indexing service, or other secondary publisher.

Within the full-text, the author's exposition may require the provisionof information that cannot be concisely conveyed using a textualnarrative. This is especially true in the presentation of researchstudies, where a textual exposition/explanation of numeric data andstatistical results may be cumbersome. In these circumstances, authorsmay present the desired information in the form of distinct componentsor objects placed within the full-text and make references to theseobjects in the textual narrative. In the art, these components arecommonly referred to as “tables” and “figures”. A table is a row andcolumn presentation of data that may be presented without there being atrend or pattern of relationship between sets of data values. A figureis a visual presentation of results, including graphs, charts, diagrams,photos, drawings, schematics, maps, etc. According to the conventions ofwritten communication, content such as tables and figures are distinctentities in of themselves and typically contain a caption that consistsof a referential label (e.g., “FIG. 1”, “FIG. 4”, etc.) and adescription (e.g., “Vitamin E concentrations in fish eggs and muscletissue” or “The effect of dietary rapeseed oil (a) and dietary vitamin Eand copper (b) on Fe²+-induced lipid oxidation of pig liver.”). Ofparticular interest to the present description are these captionedobjects or components found in print-ready articles.

According to FIG. 1A, the full-text of an article 100 commences on Page1 102 (after the title, author and abstract sections) and continues topage 10 104 (which includes the commencement of the citations). Thefull-text consists of the textual narrative, arranged in two columns andtwo captioned objects. Of the visible pages depicted, Pages 2 and 8contain objects 106, 108 pertaining to one or more embodiments of thepresent invention.

FIG. 1B depicts an exploded view of one of the objects 108 on page 8.According to the illustration, this object denoted by the authors as“FIG. 2” comprises a caption and two line graphs. The line graphs in theobject also contain information of interest to researchers in the axeslabels such as the measurement units of the variables depicted. Inaddition to the labels, there are also various legends associated withthe different axes. This valuable information which is the focus of thepresent invention is not captured by indexing or search systems in theprior art.

FIG. 2A illustrates another exemplary full-text article 200 along withFIG. 2B that depicts an exploded view of one of the objects 204 of thearticle that occurs on Page 3 202. According to the illustration, theobject to be identified and extracted is what is described in the art asa ‘table’ which in this specific instance summarizes Vitamin Econcentration in fish eggs and muscle tissue data arranged in rows withdata elements. FIG. 2C depicts an exploded view 208 of a section of Page2 206 where the first reference 210 to this object 204 is made by theauthors in the full-text of the article specifically the paragraphbeginning with “Vitamin E in Fish Tissues.” Comparing the contents ofthis paragraph of referential text with that of the captioned object(table), it will be apparent to one skilled in the art that theinformation content of the table object is far richer than the summaryprovided by the author within the full-text For example, specifictissues are detailed in the object (e.g., gonad vs. muscle vs. spleen,etc.) but not in the summary. Moreover, vitamin E concentrations of liveand commercial fish feed are displayed in the object, but are absentfrom the summary.

Hardware Overview

FIG. 3 is a block diagram that illustrates a computer system 300 uponwhich an embodiment of the invention may be implemented. Computer system300 includes a bus 302 or other communication mechanism forcommunicating information, and a processor 304 coupled with bus 302 forprocessing information. Computer system 300 also includes a main memory306, such as a random access memory (RAM) or other dynamic storagedevice, coupled to bus 302 for storing information and instructions tobe executed by processor 304. Main memory 306 may also be used forstoring temporary variables or other intermediate information duringexecution of instructions to be executed by processor 304. Computersystem 300 further includes a read only memory (ROM) 308 or other staticstorage device coupled to bus 302 for storing static information andinstructions for processor 304. A storage device 310, such as a magneticdisk or optical disk, is provided and coupled to bus 302 for storinginformation and instructions.

Computer system 300 may be coupled via bus 302 to a display 312, such asa cathode ray tube (CRT), for displaying information to a computer user.An input device 314, including alphanumeric and other keys, is coupledto bus 302 for communicating information and command selections toprocessor 304. Another type of user input device is cursor control 316,such as a mouse, a trackball, or cursor direction keys for communicatingdirection information and command selections to processor 304 and forcontrolling cursor movement on display 312. This input device typicallyhas two degrees of freedom in two axes, a first axis (e.g., x) and asecond axis (e.g., y), that allows the device to specify positions in aplane.

Computer system 300 operates in response to processor 304 executing oneor more sequences of one or more instructions contained in main memory306. Such instructions may be read into main memory 306 from anothercomputer-readable medium, such as storage device 310. Execution of thesequences of instructions contained in main memory 306 causes processor304 to perform the process steps described herein. In alternativeembodiments, hard-wired circuitry may be used in place of or incombination with software instructions to implement the invention. Thus,embodiments of the invention are not limited to any specific combinationof hardware circuitry and software.

The term “computer-readable medium” as used herein refers to any mediumthat participates in providing instructions to processor 304 forexecution. Such a medium may take many forms, including but not limitedto, non-volatile media, volatile media, and transmission media.Non-volatile media includes, for example, optical or magnetic disks,such as storage device 310. Volatile media includes dynamic memory, suchas main memory 306. Transmission media includes coaxial cables, copperwire and fiber optics, including the wires that comprise bus 302.Transmission media can also take the form of acoustic or light waves,such as those generated during radio-wave and infra-red datacommunications.

Common forms of computer-readable media include, for example, a floppydisk, a flexible disk, hard disk, magnetic tape, or any other magneticmedium, a CD-ROM, any other optical medium, punchcards, papertape, anyother physical medium with patterns of holes, a RAM, a PROM, and EPROM,a FLASH-EPROM, any other memory chip or cartridge, a carrier wave asdescribed hereinafter, or any other medium from which a computer canread.

Various forms of computer readable media may be involved in carrying oneor more sequences of one or more instructions to processor 304 forexecution. For example, the instructions may initially be carried on amagnetic disk of a remote computer. The remote computer can load theinstructions into its dynamic memory and send the instructions over atelephone line using a modem. A modem local to computer system 300 canreceive the data on the telephone line and use an infra-red transmitterto convert the data to an infra-red signal. An infra-red detector canreceive the data carried in the infra-red signal and appropriatecircuitry can place the data on bus 302. Bus 302 carries the data tomain memory 306, from which processor 304 retrieves and executes theinstructions. The instructions received by main memory 306 mayoptionally be stored on storage device 310 either before or afterexecution by processor 304.

Computer system 300 also includes a communication interface 318 coupledto bus 302. Communication interface 318 provides a two-way datacommunication coupling to a network link 320 that is connected to alocal network 322. For example, communication interface 318 may be anintegrated services digital network (ISDN) card or a modem to provide adata communication connection to a corresponding type of telephone line.As another example, communication interface 318 may be a local areanetwork (LAN) card to provide a data communication connection to acompatible LAN. Wireless links may also be implemented. In any suchimplementation, communication interface 318 sends and receiveselectrical, electromagnetic or optical signals that carry digital datastreams representing various types of information.

Network link 320 typically provides data communication through one ormore networks to other data devices. For example, network link 320 mayprovide a connection through local network 322 to a host computer 324 orto data equipment operated by an Internet Service Provider (ISP) 326.ISP 326 in turn provides data communication services through the worldwide packet data communication network now commonly referred to as the“Internet” 328. Local network 322 and Internet 328 both use electrical,electromagnetic or optical signals that carry digital data streams. Thesignals through the various networks and the signals on network link 320and through communication interface 318, which carry the digital data toand from computer system 300, are exemplary forms of carrier wavestransporting the information.

Computer system 300 can send messages and receive data, includingprogram code, through the network(s), network link 320 and communicationinterface 318. In the Internet example, a server 330 might transmit arequested code for an application program through Internet 328, ISP 326,local network 322 and communication interface 318. The received code maybe executed by processor 304 as it is received, and/or stored in storagedevice 310, or other non-volatile storage for later execution. In thismanner, computer system 300 may obtain application code in the form of acarrier wave.

Thus, two or more computers may be used to provide the fullfunctionality of the present invention using networked or connectedcomputer systems. For example, the input and output devices used by acomputer user to communicate instructions and view information may belocated on another computer system. When the two computer systems areconnected via the Internet, a computer user on the other computer systemmay output in a local web-browser and can communicate instructions tothe computer application on computer system 300 using a local inputdevice such as the user's keyboard. The user's instructions aretransmitted through the network, received by communications interfaceand transferred to processor internally via the bus.

Thus, embodiments of the present invention may be implemented as one ormore modules, routines, or applications that are executed by thecomputer systems of FIG. 300. One of ordinary skill will recognize thatthe software, regardless of it specific structure, may be stored on avariety of different media and when executed, causes the computerplatform to operate as programmed.

Extracting, Linking, Indexing and Storing Captioned Objects

FIG. 4 is a flow chart illustrating the steps performed in extracting,linking, indexing and storing an object record for information discoverytasks according to an embodiment of the present invention, starting withstep S410. At step S415, a print-ready article is loaded and readied forextraction. This step may include the retrieval of a batch of full-textarticles from a publisher and splitting into individual articles orfull-text components. Alternatively, this step may include using a‘crawler’ to fetch components of a full-text article and storing thecomponents locally. This technique may be applied to full-text articlesthat are available in mark-up language such as HTML that supportsembedded resource links.

At step S420, extraction rules are applied to the full-text record. Theextraction rules specify the type of captioned components to beidentified and extracted, as well as the attributes and optionallyattribute values that need to be extracted. According to a preferredembodiment, the extraction rules are specified for all captioned objectsin the full-text. Generally speaking however, the objects to beextracted and their attributes are dictated by externally definedbusiness requirements such as the intended information discovery use towhich the extracted objects are to service, or even the intendedaudience. For example, the construction of a “map image” database mayrequire only maps and their attributes be extracted from the full-textrecord. Likewise, the extraction rules may be specific to a particularpublisher, journal, or file format (e.g., PDF vs. HTML vs. XML), or to acombination of these factors. The extraction rules may also specifyattributes associated with the full-text of the article to be captured.According to a preferred embodiment, one such full-text attribute is the“Reference Text” such as 210, which is the fragment of the full-textthat contains the reference to the to-be-extracted object. In anotherembodiment, the sequence of objects as they occur within the full-textis collected.

The extraction rules may also specify how the identified objects are tobe labeled or tagged for future reference within the system. Assignmentof “object ids” is advantageous, since the object id is typically thekey which is used to store and retrieve the object record from thedatabase repository.

Step S425 is a decision point where the success of the extraction isevaluated. Generally speaking, this step is a quality control point thatprevents problems in extraction cascading ‘downstream’. For example, anerror condition may be flagged if the full-text makes reference to‘Table 6’ and the extraction routine does not identify this object. Afailure condition (‘No’) leads to extraction error handling Step S460.At Step S460, the cause of the failure is identified. Fixable failuressuch as those stemming from data format changes (e.g., a a change in theXML schema) are reprocessed through Step S415, whereas corrupt ormal-formed records follow the Reject step 465. The rejection step mayinclude communicating the identified rejected record and the reason forrejection back to the primary provider and submission of a request for aresubmission of the record.

The success condition at Step S425 may be based on deterministic rulesor may be according to probabilistic success thresholds for theextracted objects and the list of attributes specified for extraction.The error condition described previously is an example of adeterministic rule. An example of a probabilistic success thresholdrelates to object extraction from an image file of the full-text. Inthis instance, locating the span of the object within the image file maybe performed with a degree of certainty that does not fall withinacceptable success thresholds.

Step S430 is a collation step where a number of different records, oftenfrom disparate sources, have to be readied prior to linking. Accordingto one embodiment of the present invention, these records that need tobe ready and accessible may include the ‘Abstract’ record and the source(or publication/publisher information) record.

Step S435 links the extracted object records to the correspondingabstract and source records. At the completion of this step, eachextracted object record may be associated with an abstract record, theoriginal full-text record and the source record from which the objectwas extracted. The source record may contain information about thearticle's access rights and the time when access may be granted to thepublic. At this linking step, these source-based attributes areassociated with, or transferred to, the object record. The sourceattributes may include access rights which may differ by publisher. Inother words, extracted objects from a publisher may have the same accessrights as the full-text records, whereas access rights for objects fromanother publisher may have differing access rights than the full-textrecords from that publisher.

Step S440 is a quality control decision point, similar to S425, wherethe outcome of the linking step S435 is evaluated. The error handlingstep S470 determines the cause of the linking failure and may result ina reprocessing of the linking step, or an outright rejection of theobject records.

Indexing step S445 follows a successful linking operation. In general,this step constitutes the editorial functions comprising the steps of:validation of extraction and linking steps, assignment of search/browseattribute values, assignment of subject specific descriptors, andauthority control tasks such as spelling and name normalization. StepS447 is the final decision point, where the fully created object record,its attributes and assigned attributes are verified to be suitable foraddition to the objects repository. Records that do not meet the passingconditions are rejected and may be attached to appropriate errorresolution processes after which the record may be re-inserted at theappropriate process point described previously.

At Step S450, the fully constructed object record is stored in anobjects data repository from where it may be packaged or repurposed forspecific information discovery tasks including retrospective searching,alerting systems and browsing. The nature of the associations createdwithin the object record, amongst objects records and between theobjects, abstracts and full-text are discussed in detail below. Ingeneral, objects may be associated with each other according to theexistence of a specific attribute (e.g., “Figure”) or specific attributevalue (Image type=“Map”) that is identified by extraction step S420 orassigned at indexing step S445. Specified attributes may be multiplyoccurring. For example, the attribute INDEX TERM may contain the twovalues “Sediment Slurries” and “Salinity”. Furthermore, objects may bebi-directionally linked to the corresponding abstract record andfull-text record. The bi-directional linkages facilitate retrievalmodalities using both the full-text/abstract as the “base” and theindexed object themselves. In other words, a search and retrieval systemmay be designed to allow users to search for full-texts and/or abstractsand then communicate the object records associated with each retrievedfull-text or abstract record. Alternatively, the search system may allowa user to search or browse a repository of objects and then find or viewthe associated abstract or full-text records.

While the foregoing discussion specifies a method of indexing a set ofobjects from a single full-text article, it must be appreciated that ina production operation, an objects extraction system must be designed toaddress issues of scale and be readily deployed to leverage existing A&Iwork-flows and data flows that are not “objects” focused, but ratherfull-text and abstracts focused.

Objects Content Processing System

FIG. 6 is a block diagram of a scalable content processing system 600that may be implemented on computer system 300 for objects extraction,linking, indexing and storage to support an objects-enhancedsearch/browse service 680 that, in conjunction with a user-interface,facilitates the matching of user queries against a stored index,displays search results and retrieves documents or document componentsfor display to the user. For the purposes of exposition, and withoutloss of the full inventive nature of the specified method, this sectionwill make references to full-text article 100 and full-text article 200which may be articles from which objects may be extracted using themethod described in FIG. 4.

Object Loader 610 is the input sub-system of objects content processingsystem 600 and is designed to retrieve or accept disparate full-textsources or ‘feeds’ and create a standardized output for Object Extractor620. The Object Loader may in turn comprise one or more interfaces 612,614, 616, 618 where each interface handles a specific type of full-textfeed.

According to a preferred embodiment, a software interface is createdbased on the electronic media format or “content type” that print-readydocuments are received in. According to the illustration depicted, HTMLinterface 612 accepts full-text feeds from full-text contentrepositories that are stored in HTML format. XML interface 614 processesprint-ready records which are available in XML format, PDF interface 616for print-ready records available in PDF (Portable Document Format) andso on. According to the illustration depicted, print-ready article 100is supplied to the content processing system as an XML document whileprint-ready article 200 is supplied in PDF format.

In another embodiment, interfaces may be designed by the primarypublisher or in another embodiment by publisher/media type combinations.This componentized approach allows the addition of new interfaces tosupport new media formats without requiring major modifications to othercomponents of the content processing system 600. For example, theaddition of print-ready documents supplied in a proprietary typesettingmedia format merely requires the creation of a new interface that may beattached to Object Loader 610.

Each content type interface may contain one or more software packagesthat are required to perform the extraction of objects from thatspecific content type. For the HTML interface an HTML parser may beemployed. Similarly, for XML documents an XML parser and a style-sheetprocessor may be readied and used. PDF documents may require a PDFreader that extracts text and identifies the location of objects in thefile. For scanned or bit-mapped documents (e.g., TIFF files) an OCR(Optical Character Recognition) package may be used to recognize andextract both text and images.

Object extractor module 620 processes a print-ready article according tothe specific extraction rules 625 specified for the media-type and/orcontent source.

FIG. 5 is an illustration of an extraction rules configuration that maybe applied to a specific document (or, set of documents). Thisillustration relates to extraction from PDF source documents. Thedepicted configuration is evaluated by extraction step S420 (seeflowchart of FIG. 4) prior to the actual processing of the document.Stepping through the configuration, the first extraction rule specifiesthat only objects that are ‘Figures’ are to be extracted. In otherwords, if a table is encountered in the extraction process, it will beignored. The configuration next specifies that the caption text for thespecified objects (in this case, figure objects) is to be identified andextracted, as well as the size of the object. The extraction rulesfurther specify that in-text references and their page numbers are to becaptured. The final rule specifies that the captured object need not bepassed on for OCR recognition because extraction of other informationfrom the object is to be performed manually, or due to other businessspecifications.

Object Loader 620 and Object Extractor 620 sub-systems may be controlledby a Scheduler supervisory system 627 that performs scheduledinvocations of these sub-systems according to pre-configured businessand/or operational rules. Periodicity of publisher updates is one suchbusiness rule. For example, Publisher A may make print-ready articlesavailable on a monthly basis, whereas Publisher B may provide thiscontent on a bi-monthly basis. Alternatively, an electronic journal mayprovide newly published articles on a daily basis. In similar fashion,on the operational side, Scheduler sub-system 627 may be configured toremove, compress or archive previously processed print-ready feeds.

FIG. 7 is an illustration of the output of Object Extractor 620 for asingle object within a print-ready article that may be processed by theobjects content processing system. According to one embodiment, theformat of the output may be specified in extraction rules repository625. According to the illustration depicted, this output formatconfiguration parameter has been set to XML and includes a number ofpredetermined attributes for which values will be extracted. Accordingto another embodiment, this output may be in plain ASCII format. Inanother embodiment, file-based output may be deactivated altogether infavor of a computationally efficient in-memory data-structure orsoftware object. Additionally, the output rules may specify additionaltransformations to the extracted data based on requirements of displayservices 685. For example, uniform size thumbnail images of extractedimages may be generated for display to the user. Similarly, extractedtables from documents in HTML format may be converted to images (e.g.,JPEG or GIF) for uniformity in display size based on the limitation ofoutput screen area size in the user interface.

The illustrated XML 700 encapsulates the specified attributes andattribute values for a specific content source. These informationcomponents include an in-article object reference (“FIG. 2”) 702, thetype of object extracted (“Figure”) 704, the source 706 from which theobject was extracted(“PLoS_V_(—)3_I_(—)12_DOI_(—)30426_(—)15457885_Document.xml”), thecaption of the object extracted, the source file reference of theobject, its size and file-type and the references to this object withinthe textual narrative (in-text reference), including the physical pagelocation where the object is referred to in the textual narrative.According to the illustration depicted, there are two such in-textreferences that occur on page 1 of the print-ready article.

Editorial System 630 supports the objects indexing activities step S445.The editorial system may be connected to an Abstract Loader sub-system635 with which traditional abstract records 636 may be loaded into theabstracts repository 633. In addition, the editorial system may containa publications database repository 638 which serves as a centralized orauthoritative source of publication and publisher information. EditorialIndexing sub-system 650 provides editorial work-flow functionality byway of a user-interface, utility tools and software for editors tointeract with the contents of the data repositories and performeditorial value-add tasks. These tasks include the assignment ofdomain-specific descriptors, synonyms, normalization of spellings,standardization of record attributes such as author names, citationinformation, etc., for which a knowledge base repository 652 may beused. In addition, machine-aided indexing software (MAI) 655 processesmay be applied to facilitate, supplement or replace the human effortinvolved in the indexing process. When MAI is used in a supplementalrole, the software processes input records and using configuredrule-bases selects a set of suitable descriptor or index terms forapproval by human editors. In a fully automated configuration, the MAIsoftware assigns index terms without the human review step.

The editorial system and the repositories described minimize dataduplication of abstract records. For example, when the contents of anabstract are appropriate for two disciplines (e.g., “Biophysics” and“Geological Sciences”), and presumably to be made available forsearch/browse according to these subject categories, a single abstractrecord may contain assigned descriptors from both subject areas. Thispreferred approach is contrasted to one where the abstract record isduplicated, one for every subject area for which descriptor terms needsto be assigned. The advantage of the data minimization approach is to beappreciated in the context of indexing objects where within a singlearticle, multiple objects are available for extraction and indexing, andwhere each extracted object may be indexed for multiple subject areas.Clearly, the duplication approach would have detrimental implicationsfor scaling any objects indexing operation.

Editorial System 630 addresses another operational reality, viz., theasynchronous availability of abstract records and object records(extracted from the print-ready article). Operational factors apart,this situation is the result of established publisher practices whereabstracts are typically made available before the full-text and/or printready articles. When newly extracted objects are received into ObjectsRecords repository 632, Object/Abstract Linker 640 programmaticallyverifies the availability of the associated abstract record in abstractsrepository 633. Attributes from the Publications Database 638 may alsobe associated or linked via a database key with the objects and abstractrecords. Furthermore, the linker assigns unique identifiers to theobjects to facilitate search and browse activities that are supplied toend-users by search services 680.

According to a preferred embodiment of the present invention, theObject/Abstract Linker 640 processes objects in batch mode and signalseditorial indexing process 650 when a new set of objects is ready forindexing. According to another embodiment the linker may be attachedfirst to MAI software 655 which in turn signals the availability ofobjects for indexing. In yet another embodiment when the publisher feedsare completely synchronized, the object/abstract linker may beconfigured to run in real-time.

Object Validation and Descriptor Assignment Sub-System

FIG. 8 is an illustration of a user-interface 800 that may be providedby Editorial Indexing sub-system 650 in accordance with one embodimentof the present invention.

According to the illustration depicted, the user-interface provides an‘Object Data’ tab 810 where the captured object and its automaticallyextracted attributes are displayed as well as input areas for editorialcorrections and descriptor assignment based on editorial rules orpolicies. Output display area 815 presents the image of the extractedobject, and display areas 820 and 825 display the extracted caption andfull-text reference, respectively. Input area 830 comprises a set ofinput widgets for the human editor to assign specific attribute valuesto the extracted object. These widgets may consist of textboxes,checkboxes, radio buttons and drop-down selection lists. When the objectextraction system is configured to extract descriptor termsautomatically, or if the extraction process is integrated with a MachineAided Indexing (MAI) sub-system 655, the user interface may presentpre-selected attribute values for review to the editor. According to theillustration depicted the value of ‘Scatter Plot’ for the attribute‘Category’ may have been automatically determined and the editorialsystem may be configured to have this value selected by default, therebyminimizing the input time. The input selections may also be presented byway of pick-lists when multiple attribute values have been automaticallyextracted. For example, the extraction rules for the attribute‘Geographic Terms’ may result in the identification of multiplegeographic areas. Furthermore, when probabilistic extraction rules areemployed, a multiple selection pick-list may display attribute valuesabove a pre-configured threshold.

The editorial indexing step supports the requirement that a singleobject may be subject to the assignment of multiple sets of attributevalues. For example, an object being indexed for two disparate subjectareas may require entirely different values to a common attribute suchas “Descriptor”. In this scenario, a graph object detailing the saltconcentration in different lakes may require the assignment of thedescriptor value “Salinity” for a technical subject area, but the value“Saltiness” for inclusion in a non-technical database. More uncommon,but supported is the ability to assign different sets of attributes (andtherefore attribute values) to a single extracted object.

The editorial interface 800 may also contain additional access points toother attributes of the extracted object. According to the embodimentdepicted, the ‘Administrative’ tab provides access to key informationabout the associated ‘linked’ abstract record and/or full-text record.These data elements may include citation and location information.Furthermore, the location information may be displayed within the userinterface as hyperlinks that, upon user selection, present theassociated abstract or full-text to the user for visual inspection.

Abstract/Object Output Generator 660 performs Store Object step S450 inwhich the extracted and indexed object records are stored intoSearch/Browse indexes 670 that may be used by a Search/Browse service680 to facilitate the search and retrieval of stored objects.Additionally, the output generator may place processed full-text andobject image data into Image Repository 680 to support Display Services685. The Output Generator's rule-base 665 supplies both business andtechnology rules for the extraction and storage of objects. The businessrules may include periodicity of extraction, types of objects to beextracted (e.g., by publisher, by object type, etc.) and the nature offull-text and full-text image linkage. The technology rules may comprisethe desired output format to support a specific search engine,destination file system locations, update/replace rules and so on. StoreObject step S450 may comprise additional steps for the display of theimages of the objects. For example, a uniform sized thumbnail image maybe created from the originally extracted image. In like manner, an imageof an object may be stored in a standard image format. In a preferredembodiment, the standard format is JPEG. In cases where the originalimage format is not JPEG (e.g., GIF), the objects image may be sent toan image conversion software utility that creates a JPEG equivalent. Afurther processing step relates to the preservation of the publishercopyright at the individual object level. For this, a ‘watermarking’software application may be applied to the images of the extractedobjects whereby the copyright text is overlaid onto the extractedobject.

According to another embodiment of the present invention,Abstract/Object Output Generator 660 may be configured to output ‘objectbundles’—pre-specified sub-sets of objects and attributes—that may beused as ‘feeds’ to external systems and applications. For example, theextracted objects and the value-added attributes may be re-supplied backto the primary publisher as an XML feed. Alternatively, a manifest ofabstracts, objects and citation information for a specific research areamay be extracted and made available for download and use at aresearcher's workstation. Further, these object bundles may containsecurity attributes for their electronic transmission or copyrightattributes for which additional software applications, such as thewatermarking application described, may be employed.

Associating Objects Records with Abstracts/Full-Text for Search/Browse

According to one embodiment of the present invention, Search/BrowseServices sub-system 680 facilitates the objects-enhanced searching ofconventional abstract and full-text indexes as well as search/browse ofobjects, independent of their association with the abstract (orfull-text).

FIG. 9 is a diagram that illustrates the associations created by thecontent processing system and stored in Search/Browse indexes 670 thatmay be used by search/browse services 680. According to theillustration, Search/Browse index 670 contains two full-text records andtheir corresponding abstract records. For the purposes of simplifiedexposition, Full-text Record1 (“FT1”, with associated abstract record“A1”) is (assumed and) depicted as containing two objects (“O1”, “O2”)while Full-text Record2 (“FT2”, with associated abstract record “A2”) isdepicted as containing one object (“O3”). Furthermore, in accordancewith indexing step S445, each object may contain assigned or identifiedattributes OA1 . . . OA4 each with assigned attribute values that may bemultiply occurring. In the illustration, object attribute OA1 is singlyoccurring (O1→“V1”, O2→“V2” and O3→“V2”) while object attribute OA2 ismultiply occurring (Object Record1 contains values “W1” and “W2” forthis attribute).

The thin arrow lines depict the links or indexes that facilitatesearches across objects and abstracts (and their associated full-text).With these constructed links, a traditional search of abstractattributes (e.g., “descriptors”) will retrieve abstract records thatmeet the specified search, but will additionally contain informationabout objects associated with each abstract in the result set. If thesearch returns abstract A1, then the associated objects O1 and O2 may beaccessed by traversing the links (for example, in order to displaythumbnail images of these objects). Similarly, a search of the objectsattributes will contain information that could be provided to link backto the associated abstract record, or full-text record.

The thick arrow lines depict the links that facilitate an “objects only”search or browse modality, one that is independent of the abstract orfull-text records from which the objects were constructed. For example,a computer user may want to find all objects that are of type “Figure”and which contain “vitamin E” as an assigned descriptor. Creating theseassociations in the Search/Browse Index 670 according to the methoddescribed enables unprecedented and novel searching and browsingcapabilities than those offered in the art.

For the purpose of exposition, FIG. 10 is an illustration of theindexing of object attributes and attribute values according to anembodiment of the present invention described above. According to theillustration, there are four (extracted and/or assigned) attributes1002, 1004, 1006, 1008: “Type”, “Geography”, “Predictive Model” and“Descriptors”. These attributes may be singly occurring, multiplyoccurring, or be binary (yes or no). For example, the object “Type”attribute 1002 illustrates a singly occurring attribute, while“Geography” 1004 and “Descriptor” 1008 may be multiply occurring. The“Predictive Model” 1006 attribute is an illustration of an attributethat may be binary in nature whereby its value may be one of ‘true’ or‘false’. Using this limited set of attributes and their values, thetable 1000 additionally illustrates the occurrence of these attributesand/or attribute values in the three representative objects Object11010, Object2 1012 (both of which are associated withAbstract/Full-text1 1020) and Object3 1014 (associated withAbstract/Full-text2 1022). Even this simple illustration reveals theadvantages of indexing objects in the manner described. For example, thetable 1000 illustrates a link between Object1 1010 and Object3 1014based on the ‘Salinity’ attribute value 1016 of the “Descriptor” 1008attribute. Since Object1 1010 is associated with Abstract1 1020 andObject3 1014 is associated with Abstract/Full-text2 1022, there is nowan implicit link between Abstract/Full-text1 1020 andAbstract/Full-text2 1022 which may not have existed without theinclusion of objects data.

Exemplary Uses of a Captioned Objects-Enhanced Index in InformationDiscovery Retrospective Searching

FIGS. 11A-11E illustrate an exemplary search user-interface 1100 whichmay be integrated with Search/Browse Services 680 and Display Services685 that may be implemented on computer system 300. In general terms,the search interface allows users to:

-   -   input queries that are matched against stored indexes of both        traditional abstract/full-text records and the objects index,    -   view a result set comprising a set of records that matched the        specified query,    -   view the full record, and    -   navigate between abstract, full-text and object components.

Search interface 1100 may also comprise a plurality of navigationallinks and user-interface widgets that facilitate ease-of-use and/oraccess to ancillary activities important to the research work-flow (forexample, saving search results).

According to the illustration depicted in FIG. 11A, the query text-box1110 allows the user to specify a query (‘light absorption’). Searchbutton 1120, when ‘clicked’, submits this query to a matcher inSearch/Browse Services 680. The user may specify that the search beconducted against specific subject areas.

FIG. 11B is an illustration of a search results page 1130 comprising aresult set 1132 displayed as a ‘Summary format’. The main displayconsists of published works (abstracts or full-text) that matched thespecified search criteria (query, subject areas and other searchparameters). Each result record—such as record 1135—contains displayelements by which the user may assess the usefulness of the record tohis/her information need without having to view the entire record.According to the embodiment illustration, these attributes consist ofthe title, search terms in context fragment of the abstract text anddescriptors. The descriptors 1136 that have been assigned may behyperlinked whereby each hyperlink is in essence a pre-constructed queryfor the displayed descriptor. For example, should the user click thedescriptor ‘Mathematical models’, a new search results page would bedisplayed containing all records that have this descriptor.

In addition to abstract record attributes, the summary view for eachabstract may contain additional navigational links. For example, ViewRecord link 1137 associated with each record summary may provide theuser access to the associated full-record of the abstract. Similarly‘Full-Text’ link 1138 may provide access to the print-ready version(e.g., in PDF format) of the article. In other words, when a userselects this link, a request for the article is made to Display Services685 which, using the parameters supplied in the request, locates therequired image data within Image Repository 677 and presents the data tothe user.

User interface tab 1140 labeled ‘Tables & Figures’ in FIG. 11B is aobjects index search results indicator and conveys to the user thenumber of object records that matched the specified query, and is also ahyperlink for the user to view the matched objects. According to theembodiment depicted, the user interface transparently performs a searchof the objects index without the user explicitly selecting the objectsdatabase to be included in the search in search interface 1100. However,it should be evident to those skilled in the art that alternative userinterfaces may be constructed where the choice of inclusion of theobjects index as a distinct ‘database’ is under the control of the user.

FIG. 11C is an illustration of an objects search results page 1150displayed to the user when objects search results indicator tab 1140 isselected or clicked. Objects results set 1152 comprise a list of objectrecords that matched the query. As with abstract summary display 1132,object summary record 1155 contains display elements by which the usermay assess the usefulness of the record to his/her information need.According to the illustration, objects summary results display 1155 mayconsist of the caption text, a thumbnail image of the object, and itspublication source and assigned descriptors 1156, which as with theabstract summary display may be hyperlinked to provide access to objectswith the selected descriptor.

Furthermore, the summary display may contain additional navigationallinks to facilitate additional or ‘detailed’ access to the specificrecord. For example, the thumbnail image may be hyperlinked to afull-image view of the specific object. According to a preferredembodiment, the full-image of the object is provided to the user bymeans of a ‘pop-up’ window. In another embodiment, the object may beplaced in a user-controlled dynamically resizable output area where theimage expands or shrinks depending upon the size of the output area.Similarly, View Record link 1157 may provide access to the full contentsof the objects record 1155.

FIG. 11D is an illustration of an object record view 1160 displayed whenthe user clicks View Record link 1157. This display comprises the fullcomplement of object attributes captured, indexed, assigned and storedby the objects processing framework. View Abstract link 1162 providesaccess to the associated abstract record attributes of the specificobject. Similarly, Full-text link 1163 may provide access to theprint-ready version of the article from which the specific object wasextracted and indexed.

FIG. 11E is an illustration of the abstract record view associated withobject record 1155. Tables & Figures attribute 1165 contains thumbnailimages of the objects associated with this abstract. Object record's1155 image is thumbnailed as FIG. 1. These images may be hyperlinked totheir corresponding object record views such as object record view 1160for FIG. 1. Thus the user is able to seamlessly navigate between objectsand abstracts records bi-directionally, i.e., from abstracts to objectsand vise versa.

FIGS. 14A-14E illustrate another exemplary search user-interface 1400,which may also be integrated with Search/Browse Services 680 and DisplayServices 685 that may be implemented on computer system 300. Generally,the search user-interface 1400 allows users to perform the samefunctions as search user-interface 1100.

As shown in FIG. 14A, the query text-box 1400 allows the user to specifya search query (again, ‘light absorption’). Search button 1420, when‘clicked,’ submits the entered query to a matcher in Search/BrowsServices 680. The user may specify that the search be conducted againstspecific subject areas (here CSA Illumina Natural Sciences andEnvironmental Sciences and Pollution Mgmt databases) or in a specifieddate range. One of ordinary skill in the art would recognize that thereare a number of categories by which a search could be restricted.

FIG. 14B is an illustration of an objects search results page 1430(similar to that of FIG. 11C). The objects search results page 1430includes a objects search results set 1431, which is also displayed in a“summary format.” The summary objects search results set 1431 includestabs that include Published Works 1432 (abstracts or full-text); Tables& Figures 1433; and Web Sites 1434 that matched the entered search query(in FIG. 14B, the Tables & Figures tab 1433 is the active tab. Eachobject result record, such as object record 1435, contains displayelements regarding an object record through which the user may gain aquick understanding of the general subject matter and usefulness of theobject record without having to view the entire record. In thisembodiment, the summary of the record 1435 contains a title of theobject, here “FIG. 3. Profiles of . . . ”; a thumbnail of the object,here a graph; the title of the article in which the object appears, here“Photosynthesis within isobilateral eucalyptus leaves”; the authors ofthe article, here Evans and Vogelman; and the title, page numbers, anddate of the publication in which the object and article appear. On therighthand side of the objects search results page, 1430, the objectrecord summary 1435 also indicates the database in which the objectappears, here “CSA Illumina Natural Sciences”; and the ObjectDescriptors, here Depth, Monochromatic light, and Relative absoprtance(note that light is italicized because the word light was part of thesearch query). In this embodiment, the Object Descriptors 1436 have beenhyperlinked to allow the user to click on the hyperlink, e.g., Depth,and a new search results page(s) would be displayed containing allobject records having this Object Descriptor.

Object summary record 1435 also contains additional navigational links,such as View Record 1437, View Abstract 1438, Full-Text Linking 1439,Link to Holdings, InterLibrary Loan, and Documents Delivery. In thisembodiment, the View Record link 1437 associated with each recordsummary provides the user access to the associated full-record of theobject as shown in FIG. 14C. The View Abstract link 1438 provides accessto an enhanced abstract, which is shown for object summary record 1435in FIG. 14D. The Full-Text link 1439 may provide access to the fullarticle or a print-ready version (e.g., in PDF format) of the articlecontaining the object. In other words, when a user selects this link, arequest for the article is made to Display Services 685 which, using theparameters supplied in the request, locates the required image datawithin Image Repository 677 and presents the data to the user.

FIG. 14C is another illustration of an object record view 1450, which isdisplayed when the user clicks the View Record link 1437 in objectsummary record 1435. The object record view also contains navigationallinks, which would allow the user to quickly access the Abstract recordand the Full-Text as described above. This object record view 1450 alsocontains the attributes regarding the object record captured, indexed,assigned, and stored by the object processing framework. For example,object record view indicates from which Database the object comes; theImage File 1451 (with a link to the original image); the object Caption1452, here “FIG. 3. Profiles of . . . ”; the Category 1453 of theobject, here Figure, Branch, and ScatterPlot”; the title, author, andsource of the article in which the object appears; and the ObjectDescriptors 1454 assigned to the object. By clicking on each of thehyperlinks in Category 1453, e.g., Figure, a new search result will beprovided containing all objects that are categorized as a Figure.

In this embodiment, the object record view 1450 also contains apublisher attribution section 1455. Here, the object record view 1450also displays the publisher's name 1456, here Blackwell Publishing Ltd.;the Digital Object Identifier (DOI) 1456, which are well understood inthe publishing industry; an Object DOI 1457; the publication year of theobject and associated article and source; the ISSN, or InternationalStandard Serial Number, which is a unique eight-digit number used toidentify a print or electronic periodical publication; and accesssionnumbers. The publisher attribution section 1455 provides users withinformation regarding the publisher so that the user is aware of thepublisher and likely holder of the copyright on the object and full-textarticle.

FIG. 14D contains an enhanced abstract 1460 for the article containingthe object 1435. The enhanced abstract 1460 provides a great deal ofuseful information in summary format to aid researchers and other usersin more efficiently conducting research. Again, the enhanced abstract1460 provides the user with the name of the database 1461 where thearticle is located, here CSA Illumina Natural Sciences. The enhancedabstract 1460 provides the Title 1462 of the article, here“Photosynthesis within isobilateral Eucalyptus pauciflora leaves.” Theenhanced abstract 1460 also provides the names of the authors 1463 andtheir affiliations 1464, e.g., where an author is employed, teaches oris affilliated. The enhanced abstract 1460 provides the source 1465 ofthe article containing the object 1435. The enhanced abstract 1460details some interesting notes 1466 about the article, e.g., the nubmerof figures, tables, formulas, and references appearing in the article.The enhanced abstract 1460 also contains thumbnails of all the objects1467 appearing in the article.

When a user holds a cursor over an object 1467 (e.g., FIG. 1 in enhancedabstract 1460), an information balloon 1490 shown in FIG. 14E appearsproviding the user with the caption 1491 of the object; the Category1492 of the object; and the Object Descriptors 1493. The Category 1492and Object Descriptors 1493 are hyperlinked so that user can search byclicking the hyperlinks to receive the results as described above.

The enhanced abstract 1460 of FIG. 14D also contains a standard abstract1468. As compared to the abstract record and enhanced abstract 1460,abstract 1468 is a brief summary of a research article that is oftenused to help a reader quickly ascertain the article's purpose (anabstract almost always appears at the beginning of an article to act asthe point-of-entry for a given article).

Enhanced abstract 1460 also contains a listing of all the assignedobject descriptors 1469 that have been assigned to the objects appearingin the article. Each of the object descriptors has an empty check-box,which allows the user to check the box if the user wishes to conductanother search using the checked terms. The enhanced abstract 1460allows the user to run this additional search using the checked ObjectDescriptors with an “and” logic or an “or” logic by checking a box; butone of ordinary skill in the art would understand that any search logiccould be implemented.

The enhanced abstract 1460 also contains publisher attributioninformation 1475, which provides much of the same information that wasprovided by the publish attribution information in the object viewrecord of FIG. 14C. In addition, the publisher attribution information1475 of the enhanced abstract 1460 provides the electronic ISSN 1476 ofthe article; the language 1477 in which the article is written; and thelast update 1477 of the article.

Those skilled in the art will recognize that, while the enhancedabstract 1460 is described as containing certain fields, an enhancedabstract according to the present invention could be implemented usingmore fields, different fields, or fewer fields without departing fromthe invention.

Those skilled in the art will recognize that the objects enhancedextraction and indexing may also be incorporated into other search-basedwork flow applications such as an alerting service whereby newly addedobjects are matched against a database of stored queries and users areproactively notified (e.g., via email) about any objects that matchtheir stored queries.

Captioned Objects Browsing

FIG. 12A is an exemplary graphical user-interface 1200 that embodies anovel information discovery technique according to one aspect of thepresent invention. Specifically, the interface depicted allows a user tospecify an objects search criterion and then browse or traverse theindexed linkages using an arbitrary object as the starting point for thetraversal.

Criteria selection area 1205 comprises user-interface widgets to specifyan initial sub-set of objects of interest, based upon attributes of theobject records in the index. According to the illustration depicted, aCategory checkbox list may be presented for the user to indicate thetype of objects to be included, a geographic area or Country drop-downlist and a check-box to indicate the nature of the statistical analysisperformed. According to the illustration depicted, the user has selectedto retrieve all objects that are of type “Graph”. When the user pressessearch button 1210, all objects that satisfy the selection criteria areretrieved. Drop-down box 1220 is populated with the list of uniqueprimary variables associated with the records in the search result set.Simultaneously, drop-down box 1225 is populated with thumbnail images ofthe objects that match the specified search criteria. These thumbnailimages may be hyperlinked to provide access to a full-size image oralternatively a full record view of the object.

After viewing the initial results, the user may select specific primaryvariables of interest by clicking on the text labels listed in drop-downbox 1220. When the user indicates a specific primary variable(‘atmospheric CO’), the user-interface is refreshed simultaneously inResults drop-down 1225 and Primary Link drop-down box 1230. Resultsdrop-down box 1225 now contains only those objects which have theselected primary variable ‘atmospheric CO’. Primary link drop-down box1230 is populated with the variables that are directly associated withthe selected primary variable. According to the illustration depicted,at this point, the result set contains graph objects that associate‘atmospheric CO’ to ‘air temperature’, ‘Altitude’, ‘cloud opticalthickness’, ‘humidity’ and ‘ozone concentration’.

To navigate to the second-level associations, the user may indicatespecific variables of interest from Primary Link drop-down box 1230.According to the illustration depicted, the user selects ‘Altitude’ and‘ozone concentration’. Upon making these selections, a search (accordingto the same criteria as originally specified by the user) is conductedto retrieve all objects that are associated with these variables.Secondary Link drop down 1240 is populated with variables associatedwith the selected primary link variables. Simultaneously, hyperlinkedthumbnail images of the objects are presented in Secondary Results box1250. The user may then further filter the result by selecting aspecific secondary link of interest. According to the illustrationdepicted, the user selects ‘nitrogen oxide’, resulting in secondarysearch results box 1250 being refreshed with thumbnail images of onlythose objects that meet this selection criterion (1260).

FIG. 12B is an illustration of the full-image view of hyperlinkedthumbnail image 1260 and is a graph object showing the relationshipbetween the user selected primary links—‘altitude’ and ‘ozoneconcentration’—and secondary link ‘nitrogen oxide’. By browsing thelinkages between objects the user is thus able to discover a potentialrelationship between the original variable of interest—‘atmosphericCO’—and an indirectly linked variable ‘nitrogen oxide’.

In summary, indexing captioned objects can be immensely valuable to aresearcher interested in linking variables within or across disciplines.For example:

1) Acutely-targeted publication searches can be crafted by employingobjects oriented searches rather than traditional article levelsearches.

2) Researchers can find tables and figures containing specificvariables, ensuring that the study actually focused on that variable,rather than simply referring to it indirectly (i.e. from anotherpublication).

Example: A Google Scholar™ search, or a search using other searchengines, for a time series of sea surface height off the Galapagos mayretrieve many publications that do not actually contain data on seasurface height off the islands. (In fact, many of the results may stemfrom a match in the cited references and not the actual article).Similarly, a traditional A&I database search would not guarantee aresult list of articles containing the required quantitativeinformation. However, results from a captioned objects index,constructed in accordance with embodiments of the disclosed invention,would include records where those data were actually part of the search.

3) Categories of objects can be easily browsed (e.g., allphotomicrographs of bacteria; all graphs containing a particularvariable; all tables listing a specific element; etc.) Making visualsfor conference presentations or seminars can be greatly facilitated.

4) Spurious correlates can be identified by linking dependent variablesthrough a series of independent variables. For example, a dependence oflobster population density on sediment grain size found in one study,may actually be a dependence on bottom current speed, the controllingfactor of grain size elucidated in another study that had nothing to dowith lobsters and therefore not ‘on the radar’ of the lobsterresearcher.

Another example: Consider two lines of research on Maximum SustainableYield (MSY) in marine fisheries, one in Fisheries Oceanography and theother in Sociology. Both studies develop a predictive MSY model based onsea surface temperature (the oceanographer) and on landing statistics inthe context of fishermen ethics (the sociologist). Both avenues ofresearch would benefit from the ability to easily link a specificvariable to all other independent variables in many subject areas.Indexing captioned objects does not simply help answer researchquestions; rather, in conjunction with an objects capable computer userinterface, it provides a unique tool with which researchers can posequestions for future research.

Exemplary Use Cases

FIGS. 13A-13H accompany exemplary use cases for embodiments of thepresent invention. These use cases involve oceanography specifically butprovide exemplary evidence, in general, of the usefulness and advantagesof indexing and linking nontextual information available from articles.

ADVANTAGE 1: Identifying data from unlikely sources.

One advantage provided is that such a system enhances the ability toidentify data from unlikely sources. Physical oceanographers oftenrequire hydrographic information for their ocean current models, yettheir own data are often restricted to narrow cruise tracks. The abilityto broaden their models to include areas where they did not sample iscontingent on identifying other studies which may contain the data.These data may be hidden in the traditional article-level indexingbecause data in a specific figure or table may not be reflected in thetitle or summary. A full-text search would identify hundreds ofirrelevant publications which may mention a specific variable but notcontain corresponding data.

Specifically, temperature/salinity or “T/S” diagrams, such as those inFIG. 13A are invaluable to physical oceanographers. These graphs arefrom “Bacterial abundance and production and heterotrophicnanoflagellate abundance in subarctic coastal waters (western NorthPacific Ocean)”, Aquatic Microbial Ecology, 23(3) 2001, 263-271. Thus,FIG. 13A would be quickly identified in an object database even thoughthe context of the research is biological rather than physical, asevinced by the article and journal title.

ADVANTAGE 2: The use of an indexed object database also simplifies theability to identify spurious factors.

Example—One might assume that the growth of microscopic algae (i.e.,“primary production”) in the Gulf of Alaska is limited by the amount ofavailable nutrients (e.g., Nitrogen concentration, either as nitrate ornitrite).

How can the assumption be tested? If there are measurements of primaryproduction at a study site but there is no corresponding nitrogen data,then how is the assumption tested? A quick search of the objectsdatabase may identify a publication containing the nitrogen data for thestudy site, as shown in FIG. 13B.

This allows plotting of the primary production data against these valuesof nitrogen to determine if there is a possible correlation. It ispossible, however, that even if a correlation exists, the factorcontrolling primary production may not be nitrogen, but some othervariable that controls nitrogen distribution. Again, a search of theobject database for variables linked to nitrogen might reveal theinformation of FIG. 13C.

Discovery of secondary or spurious correlates—The graph of FIG. 13Csuggests that other variables may be important to primary production.Nitrogen concentration may be dependent on salinity, and if so, maybeprimary production is linked to salinity and not to nitrogenconcentration; i.e. nitrogen concentration is a spurious correlate.

Why would salinity be important to primary production? If a search forvariables linked to salinity identifies the table of FIG. 13D, then arelationship between salinity and turbidity can be shown. Becauseturbidity is a proxy variable for light attenuation, perhaps lightcontrols primary production? Thus, a conclusion may be reached thatperhaps more research on turbidity and primary production is warranted.

ADVANTAGE 3: Ability to identify new avenues of research.

Starting with the realization that sea scallop density on Georges Bankis concentrated on the northern flank (see FIG. 13E), the questionremains why is the density so high here, and not towards the centralbank where primary production is higher?

A quick search for maps of Georges Bank in the objects database mayreveal several variables which have similar patterns to scallop density.For example, FIG. 13F shows that scallops are concentrated in a gravelarea.

Why would scallops prefer to settle on gravel rather than mud or sand(where food is more plentiful? Perhaps there is a secondary factor: Whatvariables may be linked to the sediment size distribution? Anothersearch of the object database may locate a figure or graph that showsthat grain size is related to current velocity, as does FIG. 13G.

Perhaps current velocity is more important to scallops than substratesize. A search of the object database may allow evidence to be foundthat supports the hypothesis that current velocity on Georges Bankvaries in the same manner as scallop distribution. For example, FIG. 13Hshows the M2 residual currents on Georges Bank. Clearly, scallops areabundant where currents are high. But what variables are linked tocurrent speed that may be important to scallops? In areas of highcurrents, suspended silt concentration is extremely low. A search forsuspended silt concentration in the object database may find that siltlowers the ability of scallops to feed (i.e. relative crawl velocity ofciliary sections is lower). The distribution of scallops, therefore, mayreflect increased mortality of scallops in low flow areas. Perhaps thispossibility identifies an area for further research.

Conclusion

A number of variations to the specific behaviors and steps described inthe above examples may be made without departing from the scope of thepresent invention. The various illustrative logical blocks, modules,circuits, elements, and/or components described in connection with theembodiments disclosed herein may be implemented or performed with ageneral purpose processor, a digital signal processor (DSP), anapplication specific integrated circuit (ASIC), a field programmablegate array (FPGA) or other programmable logic component, discrete gateor transistor logic, discrete hardware components, or any combinationthereof designed to perform the functions described herein. Ageneral-purpose processor may be a microprocessor, but in thealternative, the processor may be any conventional processor,controller, microcontroller, or state machine. A processor may also beimplemented as a combination of computing components, e.g., acombination of a DSP and a microprocessor, a plurality ofmicroprocessors, one or more microprocessors in conjunction with a DSPcore, or any other such configuration.

The methods or algorithms described in connection with the embodimentsdisclosed herein may be embodied directly in hardware, in a softwaremodule executed by a processor, or in a combination of the two. Asoftware module may reside in RAM memory, flash memory, ROM memory,EPROM memory, EEPROM memory, registers, hard disk, a removable disk, aCD-ROM, or any other form of storage medium known in the art. A storagemedium may be coupled to the processor such that the processor can readinformation from, and write information to, the storage medium. In thealternative, the storage medium may be integral to the processor.

The previous description is provided to enable any person skilled in theart to practice the various embodiments described herein. Variousmodifications to these embodiments will be readily apparent to thoseskilled in the art, and the generic principles defined herein may beapplied to other embodiments. Thus, the claims are not intended to belimited to the embodiments shown herein, but is to be accorded the fullscope consistent with the language claims, wherein reference to anelement in the singular is not intended to mean “one and only one”unless specifically so stated, but rather “one or more.” All structuraland functional equivalents to the elements of the various embodimentsdescribed throughout this disclosure that are known or later come to beknown to those of ordinary skill in the art are expressly incorporatedherein by reference and are intended to be encompassed by the claims.Moreover, nothing disclosed herein is intended to be dedicated to thepublic regardless of whether such disclosure is explicitly recited inthe claims. No claim element is to be construed under the provisions of35 U.S.C. §112, sixth paragraph, unless the element is expressly recitedusing the phrase “means for” or, in the case of a method claim, theelement is recited using the phrase “step for.”

1-68. (canceled)
 69. A system for processing information from aplurality of objects contained in a plurality of documents, the systemcomprising: an objects content processing system including a processorto extract data from the plurality of objects; an image repositorysystem including computer readable media that stores object images andimages of the plurality of documents; and an index that stores dataextracted from the plurality of objects, associations between theplurality of objects, and index descriptors assigned to each of theplurality of objects.
 70. A system as claimed in claim 1, furthercomprising a first interface that receives queries to search the indexfor extracted data responsive to each of the queries.
 71. A system asclaimed in claim 2, further comprising a second interface that displaysobjects and objects in response to a request.
 72. A system as claimed inclaim 3, wherein the objects content processing system further includesa computer program that links each of the plurality of objects with arespective one of the plurality of documents.
 73. A system as claimed inclaim 4, wherein the objects content processing system further includesa user interface for accessing data extracted from the plurality ofobjects and links between the plurality of objects with a respective oneof the plurality of documents.
 74. A system as claimed in claim 5,wherein the user interface is adapted to allow for indexing theplurality of objects based on the data extracted from the plurality ofobjects.
 75. A system as claimed in claim 5, wherein the objects contentprocessing system further includes a computer program that indexes theplurality of objects based on data extracted from the plurality ofobjects.
 76. A method of identifying information in a databaseresponsive to a query from a user, the database containing informationregarding a plurality of documents, at least some of the plurality ofdocuments containing objects, containing extracted information regardingthe objects, and containing assigned index descriptors relating to theinformation contained in the objects, the method comprising: receivingthe query from the user; accessing the database in response to thequery; determining whether any objects are responsive to the query; andtransmitting information regarding the responsive objects to the user.